BANA 7042: Statistical Modeling

Lecture 2: Logistic Regression (Part I)

Brandon M. Greenwell, PhD

Western collaborative group study

\(N = 3154\) healthy young men aged 39–59 from the San Francisco area were assessed for their personality type. All were free from coronary heart disease at the start of the research. Eight and a half years later change in this situation was recorded.

Show R code
# install.packages("faraway")
head(wcgs <- faraway::wcgs)

Structure of wcgs data

Show R code
str(wcgs)
'data.frame':   3154 obs. of  13 variables:
 $ age    : int  49 42 42 41 59 44 44 40 43 42 ...
 $ height : int  73 70 69 68 70 72 72 71 72 70 ...
 $ weight : int  150 160 160 152 150 204 164 150 190 175 ...
 $ sdp    : int  110 154 110 124 144 150 130 138 146 132 ...
 $ dbp    : int  76 84 78 78 86 90 84 60 76 90 ...
 $ chol   : int  225 177 181 132 255 182 155 140 149 325 ...
 $ behave : Factor w/ 4 levels "A1","A2","B3",..: 2 2 3 4 3 4 4 2 3 2 ...
 $ cigs   : int  25 20 0 20 20 0 0 0 25 0 ...
 $ dibep  : Factor w/ 2 levels "A","B": 1 1 2 2 2 2 2 1 2 1 ...
 $ chd    : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
 $ typechd: Factor w/ 4 levels "angina","infdeath",..: 3 3 3 3 2 3 3 3 3 3 ...
 $ timechd: int  1664 3071 3071 3064 1885 3102 3074 3071 3064 1032 ...
 $ arcus  : Factor w/ 2 levels "absent","present": 1 2 1 1 2 1 1 1 1 2 ...

Description of each variable

Show R code
?wcgs
  • age: age in years
  • height: height in inches
  • weight: weight in pounds
  • sdp: systolic blood pressure in mm Hg
  • dbp: diastolic blood pressure in mm Hg
  • chol: fasting serum cholesterol in mm %
  • behave: behavior type which is a factor with levels A1 A2 B3 B4
  • cigs: number of cigarettes smoked per day
  • dibep: behavior type a factor with levels A (Agressive) B (Passive)
  • chd: coronary heat disease developed is a factor with levels no yes
  • typechd: type of coronary heart disease is a factor with levels angina infdeath none silent
  • timechd: time of CHD event or end of follow-up
  • arcus: arcus senilis is a factor with levels absent present

For now, we’ll focus on 3 variables

Show R code
summary(wcgs[, c("chd", "height", "cigs")])
  chd           height           cigs     
 no :2897   Min.   :60.00   Min.   : 0.0  
 yes: 257   1st Qu.:68.00   1st Qu.: 0.0  
            Median :70.00   Median : 0.0  
            Mean   :69.78   Mean   :11.6  
            3rd Qu.:72.00   3rd Qu.:20.0  
            Max.   :78.00   Max.   :99.0  



Anything interesting stick out?

Visualizing discrete data

Pie chart of response 🤮

Show R code
# Construct a pie chart of the (binary) response; I'm not a fan of pie charts in general
ptab <- prop.table(table(wcgs$chd))  # convert frequencies to proportions
ptab  # inspect output

        no        yes 
0.91851617 0.08148383 
Show R code
pie(prop.table(table(wcgs$chd)),
    main = "Pie chart of Coronary Heart Disease")

Visualizing discrete data

Bar charts tend to be more effective

Show R code
barplot(ptab, las = 1, col = "forestgreen")

Visualizing discrete data

Mosaic plot showing relationship between cigs and chd; not incredibly useful IMO unless both variables are categorical

Show R code
plot(chd ~ cigs, data = wcgs, notch = TRUE) 

Visualizing discrete data

Nonparametric density plot of height by chd status

Show R code
library(lattice)

densityplot(~ height, groups = chd, data = wcgs, auto.key = TRUE)

Visualizing discrete data

Boxplot of cigs vs. chd status

Show R code
plot(cigs ~ chd, data = wcgs)

Visualizing discrete data

  • Detour: decision trees are immensely useful tools for exploring new data sets

  • Useful resource: http://pages.stat.wisc.edu/~loh/treeprogs/guide/LohISI14.pdf

  • Ultimate reference: https://bgreenwell.github.io/treebook/ 😄

Visualizing discrete data

Standard CART-like decision tree

Show R code
rpart.plot::rpart.plot(rpart::rpart(chd ~ ., data = wcgs))

Visualizing discrete data

Conditional inference tree using only variables of interest

Show R code
plot(partykit::ctree(chd ~ height + cigs, data = wcgs))

Observations so far…

  • It seems that the cigs is positively associated with the binary response chd
  • It is not clear how, if at all, height is associated with chd
  • Question: how can we build a model to examine these potential associations?

Linear models

Recall that in linear regression we model the conditional mean response as a linear function in some fixed, but known parameters \(\boldsymbol{\beta}\):

\[ E\left(Y|\boldsymbol{x}\right) = \beta_0 + \beta_1x_1 + \dots \beta_px_p = \boldsymbol{\beta}^\top\boldsymbol{x} \]

If \(Y\) is a binary random variable, then what is \(E\left(Y|\boldsymbol{x}\right)\)?

The linear probability (LP) model

It turns out that \(E\left(Y|\boldsymbol{x}\right) = P\left(Y = 1|\boldsymbol{x}\right)\). The LP model assumes that \[ P\left(Y = 1|\boldsymbol{x}\right) = \beta_0 + \beta_1x_1 + \dots \beta_px_p = \boldsymbol{\beta}^\top\boldsymbol{x} \]

Is this reasonable?

The LP model: chd vs. cigs

Show R code
y <- ifelse(wcgs$chd == "yes", 1, 0)
summary(fit <- lm(y ~ cigs, data = wcgs))

Call:
lm(formula = y ~ cigs, data = wcgs)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.25268 -0.09794 -0.05876 -0.05876  0.94124 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0587608  0.0062041   9.471  < 2e-16 ***
cigs        0.0019588  0.0003339   5.867 4.91e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2722 on 3152 degrees of freedom
Multiple R-squared:  0.0108,    Adjusted R-squared:  0.01049 
F-statistic: 34.42 on 1 and 3152 DF,  p-value: 4.91e-09

Are the standard errors here appropriate? Why/why not?

The LP model: chd vs. cigs

Show R code
plot(y ~ cigs, data = wcgs, ylim = c(0, 1), las = 1)
abline(fit, col = 2)

The LP model: chd vs. cigs

Show R code
plot(y ~ cigs, data = wcgs, ylim = c(0, 1), las = 1)
abline(fit, col = 2)

The logistic regression (LR) model

Assume \(Y \sim \mathrm{Bernoulli}\left(p\right)\), where \[ p = p\left(\boldsymbol{x}\right) = P\left(Y = 1|\boldsymbol{x}\right) \] and \[ \mathrm{logit}\left(p\right) = \log\left(\frac{p}{1-p}\right) = \boldsymbol{\beta}^\top\boldsymbol{x} \] In other words, LR models the logit of the mean response as a linear function in \(\boldsymbol{\beta}\); we’ll refer to the term \(\eta = \boldsymbol{\beta}^\top\boldsymbol{x}\) as the linear predictor. Why does this make more sense?

The logistic regression (LR) model

Can always solve for \(p\) to get predictions on the raw probability scale: \[ p\left(\boldsymbol{x}\right) = \frac{\exp\left(\boldsymbol{\beta}^\top\boldsymbol{x}\right)}{1 + \exp\left(\boldsymbol{\beta}^\top\boldsymbol{x}\right)} \]

Note how the LR model is nonlinear in \(p\)!

Fitting an LR model in R

  • Use the glm() function instead of lm()
  • GLM stands for generalized linear model, which includes the LR and ordinary linear regression models as special cases
  • Many (but not all) of the models we’ll discuss in throughout this course belong to the class of GLMs
  • Note how we have to specifcy the family argument! (see ?glm for details)
  • The response can be a 0/1 indicator or a factor variable (but careful with interpretation and which class is sued as the baseline):
Show R code
levels(wcgs$chd)
[1] "no"  "yes"

Fitting an LR model in R

Show R code
summary(fit.lr <- glm(chd ~ cigs, data = wcgs, family = binomial))

Call:
glm(formula = chd ~ cigs, family = binomial, data = wcgs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9958  -0.4418  -0.3534  -0.3534   2.3684  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.742160   0.092111 -29.770  < 2e-16 ***
cigs         0.023220   0.004042   5.744 9.22e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1781.2  on 3153  degrees of freedom
Residual deviance: 1750.0  on 3152  degrees of freedom
AIC: 1754

Number of Fisher Scoring iterations: 5

Fitting an LR model in R

Show R code
prob <- predict(fit.lr, newdata = data.frame(cigs = 0:99), type = "response")
plot(y ~ cigs, data = wcgs, ylab = "chd", las = 1, xlim = c(0, 99))
lines(0:99, y = prob, col = 2)
abline(fit, col = 3, lty = 2)
legend("topright", legend = c("LR fit", "LP fit"), lty = c(1, 2), col = c(2, 3))

Fitting an LR model in R

Show R code
prob <- predict(fit.lr, newdata = data.frame(cigs = 0:999), type = "response")
plot(y ~ cigs, data = wcgs, ylab = "chd", las = 1, xlim = c(0, 999))
lines(0:999, y = prob, col = 2)
abline(fit, col = 3, lty = 2)
legend("topright", legend = c("LR fit", "LP fit"), lty = c(1, 2), col = c(2, 3))

Fitting an LR model in R

Show R code
summary(fit.lr2 <- glm(chd ~ cigs + height, data = wcgs, family = binomial))

Call:
glm(formula = chd ~ cigs + height, family = binomial, data = wcgs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0041  -0.4425  -0.3630  -0.3499   2.4357  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.50161    1.84186  -2.444   0.0145 *  
cigs         0.02313    0.00404   5.724 1.04e-08 ***
height       0.02521    0.02633   0.957   0.3383    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1781.2  on 3153  degrees of freedom
Residual deviance: 1749.0  on 3151  degrees of freedom
AIC: 1755

Number of Fisher Scoring iterations: 5

Interpreting LR coefficients

  • Let \(p = P\left(Y = 1\right)\) and \(1 - p = P\left(Y = 0\right)\)
  • The odds of the event \(Y = 1\) occuring is deined as \(p / \left(1 - p\right)\)
  • For a fair coin, the probability of getting tails is \(p = 0.5\) and the probability of getting heads is \(1 - p = 0.5\). Therefore, the odds of getting tails vs. heads is \(p / (1 - p) = 0.5 / 0.5 = 1\). (We might also say the odds of getting tails is “1 to 1”).
  • For a fair 🎲, the probability of rolling a 2 is \(p = 1/6\) and the probability of not rolling a 2 is \(1 - p = 5/6\). Therefore, the odds of rolling a 2 vs. not rolling a 2 is \(\frac{p}{1 - p} = \frac{1/6}{5/6} = \frac{1}{5}\). (We might also say the odds of rolling a 2 is “1 to 5”).

Interpreting LR coefficients

The logit models the log odds of success (i.e., \(Y = 1|\boldsymbol{x}\)) \[ \log\left(\mathrm{odds}\right) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots \beta_px_p \] Exponentiating, we get \[ \mathrm{odds} = \frac{p}{1-p} = \exp{\left(\beta_0\right)}\times\exp{\left(\beta_1x_1\right)}\times\exp{\left(\beta_2x_2\right)}\times\dots\times\exp{\left(\beta_px_p\right)} \]

  • In the LR model, \(\beta_i\) represents the change in the log odds when \(x_i\) increases by one unit (all else held constant)
  • In the LR model, \(\exp\left(\beta_i\right)\) represents the multiplicative increase in the odds when \(x_i\) increases by one unit (all else held constant)
  • CANNOT interpret the coefficients in terms of \(p\) directly…effect plots to the rescue!!

WCGS study

Show R code
coef(fit.lr2)
(Intercept)        cigs      height 
-4.50161397  0.02312740  0.02520779 
  • Holding height constant, for every additional cigarette smoked per day the predicted log odds of developing chd increase by 0.023

  • Holding height constant, for every additional cigarette smoked per day the predicted odds of developing chd increase multiplicatively by 1.023

Effect plots

Lot’s of different methods and packages: * Marginal effects via effects library * Partial dependence (PD) plots and individual conditional expectation (ICE) plots via the pdp package * Marginal effect and PD plots via the plotmo library * And many, many more * Generally similar in shape when the model is additive in nature (i.e., no interaction effects)

Effect plots

The plotmo library is an “easy button” for quick and rough effect plots (other variables are held fixed at their median value) and supports a wide range of models

Show R code
plotmo::plotmo(fit.lr2)
 plotmo grid:    cigs height
                    0     70

Effect plots

I generally prefer PD plots; see Greenwell (2017) for details

Show R code
library(pdp)
library(ggplot2)

theme_set(theme_bw())
partial(fit.lr2, pred.var = "cigs", prob = TRUE, plot = TRUE, 
        plot.engine = "ggplot2") + 
    ylim(0, 1) + 
    ylab("Probability")